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Abstract 

In this paper we present methods for attacking and defending fc-gram statistical analysis techniques 
that are used, for example, in network traffic analysis and covert channel detection. The main new result 
is our demonstration of how to use a behavior's or process' fc-order statistics to build a stochastic process 
that has those same fc-order stationary statistics but possesses different, deUberately designed, (fc + 1)- 
order statistics if desired. Such a model realizes a "complexification" of the process or behavior which 
a defender can use to monitor whether an attacker is shaping the behavior. By deliberately introducing 
designed (fc + l)-order behaviors, the defender can check to see if those behaviors are present in the 
data. We also develop constructs for source codes that respect the fc-order statistics of a process while 
encoding covert information. One fundamental consequence of these results is that certain types of 
behavior analyses techniques come down to an arms race in the sense that the advantage goes to the 
party that has more computing resources appUed to the problem. 
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Index Terms 

Covert Channels, Exfiltration, Probabilistic Automata, Cognitive Attack, Anomaly Detection. 

I. Introduction 

Computer security researchers have been investigating statistical behavioral modeling tech- 
niques as a means for determining whether a machine, a network or data packet contents are 
behaving "normally" or not. These are so-called behavior analysis techniques and implicitly 
model stochastic processes at some level of fidehty. 

Consider for example, the problem of detecting covert channels. Some existing approaches 
assume that an adversary has installed an exfiltrating agent, or Trojan, which operates by encoding 
data in a way that introduces detectable regularities in some network traffic statistics. For 
example, Giani et al. iH and Cabuk et al. [2J estimate certain first order statistics of packet 
inter-arrival delays in order to determine whether a time covert channel is being used. Dainotti 
et al. [3] learn a Hidden Markov Process BH, [|5l, (611, 0, ^ using both packet inter-arrival delays 
and packet sizes to detect traffic anomalies. Other techniques are based on various analyses of 
n-gram statistics flU. In fact, some have called techniques that match n-gram statistics "mimicry 
attacks" and while techniques have been developed for detecting certain simple types of mimicry, 
techniques for building mimicry attacks as described in the present paper appear to be novel [9|. 

General discussions of covert channels and their taxonomies, existence and modeling have been 
published [fTOll , [fTTTl . [fT2ll . [fT3l , [fT4l . The design, implementation and experimental evaluation 
of several specific covert channel attacks in real systems is of specific interest [14J. That work 
presents threat models, achievable bit rates, noise properties and channel capacities for covert 
channels. 

The existence and successful use of a covert channel is based on the assumption that the 
covert channel code does not perturb the measured statistical properties of behavior so that, 
over time, a covert transmission does not introduce discernible patterns which are different than 
expected, at least with respect to what is measured. In this paper we assume the ability to learn 
a A;-gram type model of "normal behavior." This is simply done by counting the occurrences 
of /c-grams and then normalizing to produce frequencies or probabilities. It is important to note 
that researchers often talk about entropy as a channel statistic [fTOl . [[TSll but entropy is typically 
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calculated from A;-order statistics so that our methods for preserving /c-order statistics preserves 
all lower order statistics and will also preserve the entropy. 

We present a technique for encoding messages that respects these /c-order statistics. Both 
attacker and defender can use this coding technique. The attacker could exfiltrate coded in- 
formation while the defender could embed an encoded reference message or carrier to detect 
manipulations of the channel by an adversary attempting covert communications. That is, for 
any order k, an attacker or a defender can encode covert messages while otherwise respecting 
the A;-order statistics of the traffic. 

Also, we show how a defender can create a process of order k + 1 which has the same 
A;-order statistics but specifically designed (k + l)-order statistics that the defender can easily 
monitor to see if the (A; + l)-order statistics have been changed. Researchers have recently started 
to develop systematic taxonomies and examples of attacks against statistical machine learning 
techniques [[T6l . In that spirit, the present work develops specific techniques to both attack and 
defend using certain statistical approaches. 

We discuss these methods in the context of behaviors that have a finite set of observable 
symbols (the alphabet). Interpacket arrival times, packet sizes, header fields, packet contents and 
so on are examples of such observables if quantized into a finite number of bins. Our approach 
models the observables as a stationary stochastic process X [[TtII . After estimating the fc-order 
statistics, we build a Probabilistic Deterministic Finite Automata model (PDFA) [[I9l, [|20l , 
[I2TII that realizes the fc-order statistics. 

Using that PDFA, we show that: 

1) an adversary can encode messages covertly while respecting the A;-order statistics; 

2) the defender can encode reference messages or a carrier while respecting the /c-order 
statistics and; 

3) the defender can build a more complex process which has the same /c-order statistics but 
possesses deliberately designed (/c + l)-order statistics. 

Examples of such covert channels in network traffic include, but are not restricted to: 

• Timing Channels: The observable symbols are the inter-packet time delays, appropriately 
quantized; 

• Size Channels: The observable symbols are quantized sizes of the packets; 

• Header Channels: The observable symbols are various header fields in TCP/IP packets 
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which can be manipulated by the transmitting entity without violating protocol semantics. 

Several such fields are known to exist [22]. 
It is important to clarify right away what we mean by a /c-order statistic and a /c-gram. 
Suppose we have an alphabet consisting of {a,(3,'j} and we observe a sequence comprised of 
that alphabet, say 

aaa/3a/377. 

The first order statistics are [1/2 1/4 1/4] indicating that 1/2 of the symbols are a's, II A are /3's 
and 1/4 are 7's. The 1-grams are merely {a,/3,7}. 

The 2-grams observed in this sequence are aa, aa, a/3, /3a, a/3, /37, 77 and the 2-order 
statistics for the 9 possible 2-grams 

aa, a,9, a7, /3a, /3/3, /37, 7a, 7/3, 77 

are respectively 

[2/7 2/7 1/7 1/7 1/7]. 

That is, our A;-grams are obtained by moving a sliding window of width k across the data one 
symbol at a time. This is not to be confused with moving that window across the data sequence 
k symbols at a time. 

The following discussion shows how a timing covert channel can be constructed based on a 
beacon and argues that a naive encoding of covert messages based on packet inter- arrival times 
produces a clearly detectable distortion of the 1-order statistics of those time intervals in network 
traffic 121,1111. 

Figure [T] describes the setup. Machine A sends a regularly timed beacon to machine D. (Such a 
beacon can be a time server request or a stay alive beacon for instance.) The inter-packet delays 
seen at machine B are not regular due to internal routing delays in the LAN. (These statistics 
were actually measured from a regularly timed beacon traveling several hops.) An intruder was 
able to compromise and control machine B which is inside the local network and a relay for the 
traffic between A and D. (B could be a proxy server, border router or other device for example.) 
Assume we set up a machine C outside the internal network perimeter to check for timing covert 
channels. C has seen a certain distribution of inter-packet delays coming from A going to D. 
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Internal network 




1. Beacon sent at regular 
time intervals (eg, stay 
alive, time server, etc) 



This 1^* order distribution m\\ be the 
same whether or not there is a 
covert message embedded by B. 



2. Beacon is received at 
border router vt/ith 
irregular inter-packet 
delays due to internal 
LAN congestion, etc. 



A compromised border router can 
embed a covert message using the 
same 1^^ order statistics as A's normal 
beacon statistics as seen at B. 



External network 



3. Beacon is received 
outside the LAN with 
similar irregular inter- 
packet delays 



4. Destination of beacon 
where existence and 
content are relevant but 
time delays are 
Irrelevant. 



This 1^' order distribution will be almost 
the same whether or not there Is a 
covert message embedded by B. 



Monitoring outside the network 
will show that 1^* order statistics 
have not changed. 



Fig. 1. An intruder controlling machine B inside a local network exfiltrates data coded in inter-packet delays received by 
machine C en route to machine D. Monitoring outside the intranet LAN will show the same first order inter-packet delays with 
or without the covert channel as constructed in this paper. 



In this paper, we show how machine B can encode covert messages in the inter-packet delays 
in such a way that the first order statistics as seen by C remain unchanged from the original 
distribution. Conversely, we can deliberately defend against such channels by encoding messages 
so that any manipulation of the delays will be detectable on the outside at machine C, because 
the covert message will not be received at C. 

Figure [2] shows the number of packets received with a given delay in two scenarios. The 
horizontal axis reports the inter-arrival time in seconds, and the vertical axis the number of 
packets received with those delays. In the left graph of Figure |2] are the observed inter-packet 
delays resulting from a regularly timed beacon traversing multiple hops in a LAN. On the 
right hand graph, we depict a naive covert timing channel using two time intervals to encode a 
message. It is evident from the data that the naive covert communication in the right graph can 
be easily detected if the 1-order statistics of normal traffic have been measured and are those on 
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Fig. 2. First order statistics of inter-packet delays of normal traffic (left) and a poorly designed covert channel using two 
delays only (right). (Packets were deliberately routed to hop several times between source and destination.) This paper develops 
techniques for creating covert channels that have the same statistics as the ones depicted on the left, even if higher order statistics 
are measured. 



the left. However, the 1 -order distribution on the left can be generated either by normal traffic, 
as it was obtained, or by a covert channel, as we will show. 

In this contribution, we develop a more sophisticated approach than the naive approach insofar 
as we consider also statistics of arbitrarily higher order, i.e. k > 1, and our results effectively 
show that, for any k, defenders and attackers both have technical approaches for, respectively, 
attacking or defending a A;-order behavior with respect to covert communications. Consequently, 
the situation is an arms race in the sense that whichever side has the ability to learn the highest 
order statistics wins. 



A. Outline of the Paper 

In Section |Il] we present an illustrative example. In Section III we describe our method and 



show how to manipulate a behavior's statistics with Probabilistic Automata. In Section IV 



we 



provide a numerical example. In Section |V] we show how to use Probabilistic Automata to 
build a channel code that respects the statistics of traffic up to some predecided order. Finally, 
Section |VI] contains some conclusions and future work references. 
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II. A Simple Illustrative Example 

To illustrate these concepts, consider a simple binary observable with values, and 1. It is 
assumed that these observables are irrelevant to the normal operation of the underlying system 
and its semantics. For example, the observables could be quantized inter-arrival times or unused 
packet header fields. 

Assume that the 1-order statistics of these observables are tq > and ri > with ro+ri = 1. 
This means that the relative frequency of O's and I's as observed in the behavior are ro and ri 
respectively. Now suppose an attacker has estimated these probabilities and seeks to exfiltrate 
messages while respecting these probabilities. This is possible and, later in this paper, we review 
standard source coding ideas that allow the attacker to create such codes efficiently. 

In fact, if the messages to be sent are binary and Bernoulli with p — 0.5 (such as for encrypted 
and/or compressed messages), then there are codes that use l/H{ro) — 1/H{ri) bits in the covert 
channel per original message symbol where H(x) — —{x\og2X + (1 — a;)log2(l — x)) is the 
entropy function. We show how to construct such codes to respect fc-order statistics as well. 

By the same token, the defender can encode a reference signal, also respecting the first order 
statistics as above, which can be decoded and verified at the receiving end. 

Note that no specific second order statistics roo, roi, rio and rn have been modeled so far, but 
if the process is modeled by a Bernoulli process with p — ri then the second order statistics 
would be Tj • rj — rij — Prob{ij) — Prob{ji) by independence. 

However, the defender can construct a second order process with second order statistics 
^00; ^01; ^10 ^^d Tn for wWch 7^ Tj ■ r-j while satisfying the required first order statistics, 
namely ro and ri. If the attacker exploits the channel through a purely first order process, the 
constructed second order statistics will likely not be observed by the defender who could 
then conclude that the traffic is being shaped by an adversary. 

To illustrate this 2-order construction, consider an automaton with two states, Q — {0, 1}, 
corresponding to the two 1-grams of observables. Let X be the matrix of the transition proba- 
bilities 

^ Poo Poi 
X — 

PlO Pll 

We seek PDFAs that have the stationary distribution tt = [ro 1 — ro] — [1 — ri ri] = [1 — r r] 
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Specifically, we seek X that satisfies 

[1 — r r] - X — ttX — tt — [1 — r r] 

with X being a stochastic matrix (non-negative with row sums equal to 1). The class of PDFAs 
that are 1 -order equivalent to the given process is therefore determined by a set of linear equality 
and inequality constraints as follows: 

(l - r) ■ poo + r ■ poi = 1 - r 
(1 - r) -pio + r -pii = r 

Poo + Poi = 1 

Pio+Pii = 1 
Pij > 0. 

The four equations are linearly dependent and we can reduce them to the three equations and 
constraints 

r ■ pii - {1 - r) ■ Poo = 2-r-l 
Poo + Poi = 1 
Pio + Pll = 1 
< Poo , Pll < 1- 
There are an infinite number of solutions according to 

1 — T — 1 

Pll = Poo H , < Poo ,Pii < 1- (1) 

r r 

For example, if r = 0.3, 1 — r = 0.7 then the constraints become: 

0.7poo - 0.4 
= 0.3 

< Poo , Pll < 1 

so letting poo — 0.8 we get pu — ^ = 0.53 and therefore poi — 0.2 and pio = 0.46. This 
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yields 2-order statistics of: 

roo = PooTTi = 0.8 ■ 0.7 = 0.56 

roi = poiTTi = 0.2 ■ 0.7 = 0.14 

no = PioVTa = 0.46 ■ 0.3 = 0.14 

m = puTTa = 0.53 ■ 0.3 = 0.16. 

Notice that roi = rio = 0.14, roi + roo = = tti = 0.3 and roi + rn = ri = vra = 0.7 as 
required. 

Another, equivalent way to derive these relations is to note that there are two trivial solutions 
for X, namely Xi = I2 (the 2 by 2 identity matrix) and X2 = 1 ■ n where 1 = [1 1]"^ is the 
column vector whose entries are all I's. These two solutions are always different. Moreover, we 
can see that any convex combination pXi + (1 — p)X2 for < p < 1 is also a solution to all 
the constraints and in fact yields the same class of solutions as above. 

The point of this example is that we can shape the second order statistics of the observables 
without changing the first order statistics. In particular, multiple choices for poo (and so for rgo) 
are possible, all of which lead to the same 1-order statistics. A defender can shape the second 
order statistics so that if an attacker only obeys the first order statistics, the defender can detect 
that the expected second order statistics are wrong. 

Note that the second order process in this example satisfies an additional constraint - namely, 
the marginal distributions must agree with the first order process, namely roi + rn = ri and so 
on. Moreover, roi = rio must be true as well (this is a symmetry which arises from considering 
the to 1 and 1 to transitions in the observed sequence which must be equal). For higher 
order processes, the construction involves identifying and dealing with additional constraints and 
finding realizations which satisfy them. These generalizations to higher orders are one of the 
main contributions of this paper. 

To apply this construction to the empirical data shown in Figure [2} normalize the counts 
into frequencies or probabilities by dividing by the total packet count. This yields a vector of 
probabilities: 

R = [ 0.0029 0.0144 0.0734 0.1453 0.3094 0.1295 0.1151 0.1079 0.1007 0.0014 ] (2) 
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where the coordinates 1 through 13 correspond to delays of 0.01 through 0.13 in increments of 
0.01. 

We seek to construct a Markov Chain whose states correspond to observable inter-packet 
delays and whose transition probabilities, P, describe the probability that one delay follows 
another. As explained above, P must satisfy two matrix equations (capturing the facts that R is 
a stationary vector for P and that P is row stochastic) 

R*P = R and P*l = l 

where 1 is the column vector of all ones. Moreover, the entries of P are all non-negative. 
In this simple case, there are two solutions which are simple to identify, namely 

Pb = 1 * and Pd = 7 (3) 

where / is the 13 by 13 identity matrix. The reader can easily check that both these matrices 
satisfy the two required matrix equations. This construct is simple for 1 -grams but becomes 
more complex for general /c-grams as shown below. 

Moreover, for any < a < 1, Pq = aPs + (1 — q;)Pd is also a solution. Whereas Pb defines 
a BemoulU process and Pd describes a completely disconnected Markov Chain with an infinite 
number of fixed distributions. Pa defines a Markov Chain that is irreducible, aperiodic and not 
a Bernoulli process for any < a < 1 . Therefore, Pa can be used by a defender to create 
specific second order statistics which an attacker would have to first model and then respect. 

III. Constructing the Automata 

In this section, we show how to construct automata that can reproduce observed statistics 
computed from data. 

Let E = {a.b.c,...} be the finite observable alphabet and cr = |E| < oo be the number of 
observables. We are assuming that we have sequences of observables from which we compute 
the relative frequencies of /c-grams (k > 1): 

< R{x) < 1, Yl ^(^) = 1- 

Here E'^ is the set of A; -grams; that is, the set of all possible sequences of length k drawn from 
the alphabet E. 
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Roughly speaking, if soSi...s„_i = Sq-^-i is an observed data sequence of length n > k, R{x) 
is approximated by the number of occurrences of the substring x in S'o:n-i divided by the total 
number of substrings of length k in S'o;n-i, namely n — k + 1. The set of -R(x)'s is precisely 
what we mean by the /c-order statistics of the observations. 

These statistics must satisfy certain regularity conditions required by the proposed construction 
so some care must be taken in their computation. Specifically, the identity 

J2Riay) = J2Riyb) = Riy) 
aes fees 

should hold for every y E T.^^^. This can be accomplished by appending So.n-i with soSi...Sk-2 
as a suffix, creating a periodic string effectively, and counting occurrences in the periodic string. 

Moreover, this can be repeated for every 1 < j < A; by using a circular buffer appending 
SoSi...Sj_2- AH marginal distributions 

^ R{wy)= R{yw) = R{y) 

will hold for all y E S-' then. (Details are left to the reader.) 

We will now construct a special type of Markov Chain in which E'^ are the states and the 
semantics of the /c-grams are preserved so that if x = ay G S'^ is an observed fc-gram, then 
P{ay,yb) is the probability of transitioning to state yb where both a, 6 G S. Such transitions 
are the only ones possible in the Markov Chain /c-gram model. Such models are called k^^-order 
Markov Models, k Markov Chains or A;-gram models by different authors llT9l . [|24|. 

Let TT be the vector of measured A;-gram statistics, R{x), and let P be the desired Markov 
Chain transition probabilities: 

P = {P{x,x')) 

where the entries of both tt and P are indexed by x, x' E T.'^. 

The stationary probabilities of the desired Markov Chain are precisely tt when the equa- 
tion ttP = tt is satisfied. This matrix equation consists of a'' equations and the stochasticity 
requirement on P is another cr^ equations resulting in the following 2a'' equations overall: 

P{x,x')R{x) = -R(x'), yx' eT.'' , (stationary probability conditions) (4) 
P{x,x') = 1, Vx G S''' (probability requirements) (5) 
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where P{x, x') > as well. 

Because of the relationship between fc-grams and the Markov Chain that we are seeking to 
construct, we can only have P(x, x') ^ when x = ay and x' = yb for some a, 6 G S and 
y E S'^^^. That is, y is the suffix of the state x = ay and we can only transition to states x' = yb 
which have y as a prefix and some suffix 6 G S. Accordingly, for every y G T.'''^^, we have the 
2a equations 

J2P{ay,yb)R{ay) = R{yb), V6 G S , (6) 



Y,Piay,yb) = 1, Va G S (7) 



Piay,yb) > (8) 

which are completely decoupled from the equations corresponding to (A; — 1) -grams other than 
y. Accordingly, we can solve each system independently. 

Noting that the fc-grams statistics, R{x), satisfy the marginalization relations 



Y^R{ay) = Y.R{yb) = R{y), Vy G 

aes fees 

summing over b in the equations (|6]), we get 

Y: E P(^y^ yb)R{ay) = E E P(^y^ vmM = E = E Riv^) = Riv) 

which is an identity not involving the unknown P{ay,yb). 

Accordingly, there are no more than 2a — 1 linearly independent equations in ([6]). In fact, if 
we define pre{y) to be the number of nonzero R{ay) and post{y) be the number of nonzero 
R{yb), there are in fact no more than pre{y) ■ post{y) unknown probabilities, P{ay, yb), and no 
more than pre{y) + post{y) — 1 independent equations altogether. 

A. The Standard Solution 

One solution to the equations, which we call the Standard Solution, is P{ay, yb) = R(yb) / R{y) 
because then 

E Piay,yb)Riay) = E R{ay)R{yb) / R{y) = R{yb)/R{y) E Riay) = R{yb) 

and 

J2Piay,yb) = R{yb) / R{y) = R{y)/R{y) = 1. 

bGS bGE 
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This specific solution has pre{y) ■ post{y) nonzero probabilities, P{ay,yb), for the substring 
y E S'^^^ by construction. 

By construction, this Markov Chain is irreducible because we have constructed the transition 
probabilities from a circular buffer so that there is a nonzero probability of going from any 
state with nonzero probability, namely R{x), to any other state with nonzero probability. If 
additionally the constructed Standard Solution Markov Chain is aperiodic, its unique stationary 
distribution is precisely R(x) and its entropy rate is 

H{P) = HpiXk+i\X^) = -Y. E T.RMPi^y^yb)^og{P{ay,yb)). (9) 
B. Extended Solutions 

If pre{y) and post{y) are both strictly greater than 1, then pre(y) ■ post{y) > pre{y) + 
post{y) — 1. From the theory of linear programming, there are feasible solutions to the linear 
program defined by (|6]), ^ and ^ which have no more than pre{y) + post{y) — 1 nonzero 
coordinates, namely the Basic Feasible Solutions [|25l . 

Let such a Basic Feasible Solution be P{ay,yb). As derived above, there are solutions with 
exactly pre{y) ■ post{y) nonzero coordinates, namely the Standard Solutions, P(ay,yb). Note 
that strict convex combinations of P with P, Pa = uP + (1 — u)P with < m < 1, define a 
continuum of solutions to ([6]), (|7]) and ([8]), with each solution corresponding to an irreducible 
Markov Chain. This is the case because every state is reachable from every other state with 
nonzero probability due to the construction of the Standard Solution. 

Moreover, when pre{y) and post(y) are both strictly greater than 1, P and P are different. 
As an aside, we have observed that Basic Feasible Solutions typically result in reducible chains 
because those solutions involve a minimal number of nonzero transition probabilities. 

IV. Numerical Examples 
In this section we demonstrate the constructions described above. 

1) We consider data generated by the automata depicted in Figure [3] which is a Hidden Markov 
Model (HMM), M = {A{0),A{1)}, defined by the two transition matrices 



A{0) = 



0.5 0.5 













0.5 




0.5 
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I 0.5 




1 I 0.5 



Fig. 3. A two state Hidden Markov Model used to generate data for the example. Transitions between states are labeled with 
the emitted symbols and the probabilities that the transitions occur so that | 0.5 means the transition occurs with probability 
0.5 and emits the symbol "0." 



The stochastic process of observables is not Markovian of any order as can be seen by 
the fact that 

P{yt = 0\yt-k...yt-i = 0^=) ^ P{yt = 0\yt-k-iyt-k...yt-i = 10^=) 

for any k. Moreover, it can be shown that this process is not equivalent to any Probabilistic 
Deterministic Finite State Automaton. 

We generated a sequence of 1000 observations by performing a simulation of this HMM, 
starting in state 1. 

2) We setk — 2 and computed the statistics R{xy) and R{xyz) by scanning the data sequence 
from left to right and computing sample averages as appropriate: 



R{00) 


= 0.513 


R{01) 


= 0.244 


R(10) 


= 0.244 


R{n) 


= 0.000 
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and 



i?(000) 


= 0.338 


R(OOl) 


= 0.174 


R{010) 


= 0.244 




— n nnn 

— u.uuu 


R{m) 


= 0.174 


i?(101) 


= 0.070 


i?(110) 


= 0.000 


i?(iii) 


= 0.000 



Observe that R{01) = i?(10) which is a necessary regularity that follows from the marginal- 
ization property: 

Y^R{ay) = J2R{yb) = R{y) . 

a b 

In order to be sure that the estimates verify those consistency conditions we have treated 
the data stream as a circular buffer as described previously. 
3) We built the Standard Solution, P, where P{ay, yb) = R{yh) / R{y), and then we computed 
a different numerical solution, P, of the linear program ^ and ([8]){^The two solutions 
are summarized below: 



P(00,00) 


= 0.678 




P(00,00) 


= 1.000 


P(00,01) 


= 0.322 




P(00,01) 


= 0.000 


P(10,00) 


= 0.678 




P(10,00) 


= 0.000 


P(10,01) 


= 0.322 




P(10,01) 


= 1.000 


P(01,10) 


= 1.000 




P(01,10) 


= 1.000 


P(01,ll) 


= 0.000 




P(01,ll) 


= 0.000 


nil, 10) 


= 1.000 




nil, 10) 


= 0.000 


P(ll,ll) 


= 0.000 




nil, 11) 


= 1.000 



Note that the Basic Feasible Solution, P, has a maximal number of zeros and results in a 
reducible chain with three communicating classes, namely 00, {01, 10}, 11. By convexity 

'P is a Basic Feasible Solution obtained by employing the Matlab linprog function. 
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Pu — u ■ P + {1 — u) ■ P is also a solution, for any < rt < 1, so that for u — 0.5 and 
u — 0.2 we obtain respectively the following two different 2-grams: 



Po.5(00,00) 


= 0.839 




^0.2(00,00) 


= 0.936 


^"0.5(00, 01) 


= 0.161 




^0.2(00, 01) 


= 0.064 


^0.5(10, 00) 


= 0.339 




Po.2(10,00) 


= 0.136 


^0.5(10,01) 


= 0.661 




^0.2(10,01) 


= 0.864 


Po.5(01,10) 


= 1.000 




Po.2(01, 10) 


= 1.000 


Po.5(01,ll) 


= 0.000 




P0.2(01,ll) 


= 0.000 


i^0.5(ll,10) 


= 0.500 




^0.2(11, 10) 


= 0.200 


Po.5(ll,ll) 


= 0.500 




P0.2(ll,ll) 


= 0.800 



4) Now compare the original 2-order statistics specified by M with the statistics specified by 
the two new models, namely Pq.s and P0.2 as above: 



i?(00) 


= 0.513 




i?o.5(00) 


= 0.513 




i?o.2(00) = 


0.513 


i?(01) 


= 0.244 




i?0.5(01) 


= 0.244 




i?0.2(01) = 


0.244 


i?(10) 


= 0.244 




i?o.5(10) 


= 0.244 




P0.2(10) = 


0.244 


i?(ll) 


= 0.000 




i?0.5(ll) 


= 0.000 




i?0.2(ll) = 


0.000 



They are numerically identical as expected. Finally we verify that the 3-order statistics 
are all different from each other and from the 3-order statistics of the original data, R, 
previously listed. 



i?(000) = 0.348 

i?(001) = 0.165 

i?(010) = 0.244 

R{On) = 0.000 

i?(100) = 0.165 

i?(101) = 0.079 

i?(110) = 0.000 

i?(lll) = 0.000 



i?(000) = 0.513 

i?(001) = 0.000 

i?,(010) = 0.244 

i?(011) = 0.000 

i?(100) = 0.000 

i?(101) = 0.244 

i?(110) = 0.000 

i?,(lll) = 0.000 



^0.5(000) 
i?o.5(001) 
i?o.5(010) 
i?o.5(011) 
i?o.5(100) 

i?0.5(110) 
i?0.5(lll) 



0.430 
0.083 
0.244 
0.000 
0.083 
0.161 
0.000 
0.000 



i?o.2(000) = 0.480 

i?o.2(001) = 0.033 

i?o.2(010) = 0.244 

i?0 2(011) = 0.000 

i?0 2(100) = 0.033 

i?o.2(101) = 0.211 

i?0 2(110) = 0.000 

i?o.2(lll) = 0.000 



These 3-order statistics are calculated using the relationships 

R{ayb) = R{ay) ■ P{ay, yh) 

for the various R, a, y, b. Moreover, the Ru are the same convex combinations as the 
the various P^'s. This example illustrates the various constructions we have described in 
complete generality in the previous section. 
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V. A Covert Channel Coding Technique 

In the previous section, we showed that given observed string frequencies, R{z), z E T.'' we 
can construct muMple Markov Chains, M, whose states are the A;-grams (z E S'^), transition 
probabilities are P{ay,yb), a G S, y E T.'^^^ and whose stationary distributions are precisely 
the observed R. 

We now show how to use such a Markov Chain to encode messages while preserving the 
statistics, R, of the channel. This means that someone monitoring the channel will observe the 
same A;-gram statistics in spite of the fact that covert messages can be communicated within that 
channel. As noted before, this can be exploited by either attacker or defender. 

Conceptually, the coding concept is the opposite of the classical Shannon Source Coding 
Theorem [jTTl in the sense that traditionally we start with a stochastic source with entropy rate 
H that we seek to compress into binary strings whereas in this case we start with a collection of 
2'' messages which we wish to efficiently encode using the dynamics and statistics of the given 
stochastic process. Because we have to respect the statistics of the channel, the encoding will 
typically not be be compressing but expanding the number of bits needed. Nonetheless, we still 
seek efficiency with respect to observing the channel's fc-gram statistics. 

In this work, we will assume, for simplicity, that the communication covert channel is noiseless 
noting that the results can be extended to noisy channels in the traditional way. A more thorough 
analysis is deferred to a future study in which the Shannon capacity of noisy channels will be 
considered. 

This construction involves several steps: 

1) Compute the entropy of the irreducible Markov Chain M, Hm, specified by transition 
probabilities, Pm, and stationary distribution, Rm' 

Hm = - J2 PMiay, yb) hg^iPMiay, yh)). 

Note that we construct the Markov Chains to have a given stationary distribution, R, so 
only Pm is different for the different models. 

For the examples developed in the previous section, we have computed: 

Hp = 0.6863 , Hp = , Hp^ ^ = 0.3165 , Hp^^ = 0.5520. 
Note that P is entirely deterministic and so has zero entropy. 
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Since we constructed these Markov Chains so that different transitions from a state cor- 
respond to different observables (that is, be DPFA's), knowledge of the initial state of 
the Markov Chain results in a one-to-one correspondence between state sequences and 
observation sequences. Hence the entropy rates of both the Markov Chain state sequences 
and resulting observation sequences are the same. 

Let ys represent the stochastic process of observations produced by the constructed Markov 
Chain, M, starting in state s E M. All states in M are recurrent by construction so the 
entropy rate of each process is the same and equal to Hm- 

2) Apply the Shannon-MacMillan-Breiman Asymptotic Equipartition Property (AEP) The- 
orem IfTTII to each X showing that for large n there are approximately 2"^^^ typical 
sequences of length n of X and each occurs with probability approximately (1/2)"^^*^. 
Consequently, in order to encode 2'' covert messages, say Ci with 1 < i < 2^ we must 
have r < uHm or equivalently n > t/Hm so n is selected to encode 2"^ different covert 
message sequences accordingly. 

3) Construct length n typical sequences of ys by starting in state s and then performing a 
random walk of length n m M according to the probabilities Pm- Such random walks 
define observation sequences of length n in S". Produce 2'" < 2"^" unique sequences for 
each state s, labeling them as Ys{i) where 1 < i < 2*" < 2^^^' . (If a random walk produces 
a sequence already generated, simply repeat until a novel random walk is produced.) 

4) Note that the /c-gram frequencies of each z G S'^ within the Ys{i) approach the original 
R{z) as n — )• oo because R is the stationary distribution of the Markov Chain and Ys{i) 
is produced by taking a random walk in the chain. 

5) For each state, s, assign the covert message Q to Ys{i). Pick a random initial state s(0) 
and assign a sequences of covert messages Ci^Ci^...Ci^ to 

Ys{Q){ii)Ys(i){i2)...Ys (m— 1) \^m) 

where s(j) is recursively defined as the state in which Ys(^j_i){ij) ended. 
Because each random walk in the sequence thus constructed starts in the state in which the 
previous random walk ended, the concatenated sequence of random walks is also a legal random 
walk in the Markov Chain, obeying all the transition probabilities. Moreover, the A;-gram statistics 
in the overall concatenated sequence of mn observations is approximately R and approaches R 
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Symbols (Coded Inter-Packel Delays) 



Fig. 4. The frequencies of the 13 different delays measured from the codeword sequence are in the right graph. This is to 
be compared with the left graph which is from the empirical data as in Figure |2] Someone monitoring the delays would see no 
change in the distribution but a covert channel is present. 



as n — oo. The encoded sequence is uniquely decodable by the receiver as well. 

To illustrate this construction, consider the example presented in the left of Figure [2] where 
we use the 1-order statistics as in equation ([2]). We take the convex combination (see Section |ll]) 

P = 0.75 ■ Pb + 0.25 ■ Pd, 

which results in an entropy of Hp = 0.004 as computed from (|9]). We build a {2^,n) codebook 
as described above with r = 8 and n = \r/Hp~\ = 1995. That is, this encodes binary sequences 
of length 8 into inter-packet delays of length 1995. We encoded 16 blocks of 8 random source 
bits each into 16-1995 = 31920 symbols from the alphabet S = {1, 2, . . . , 13} which correspond 
to the delays in the left graph of Figure |2j 

The obtained 1-order statistics of the resulting 31920 long concatenated codeword are 

R' = [0.0029 0.0144 0.0734 0.1453 0.3094 0.1295 0.1151 0.1079 0.1007 0.0014]; 

and are depicted in Figure |4j Note the empirical frequencies and graphs are identical to the 
displayed precision. 

This illustrates empirically the effectiveness of the construction described in this paper. Matlab 
code for reproducing these results is available upon request. 
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VI. Conclusions and Future Work 

This paper has demonstrated that covert channels can exist even when arbitrarily high order 
statistics about a channel are estimated and monitored. The resulting covert channels can be used 
to either exploit or defend the channel and the advantage goes to the party that has the ability 
to estimate the highest order statistics. 

The adversarial nature of this situation falls within the scope of cognitive attacks [|26l . [|27l . 
It can be described in abstract as follows: the environment (for example, inter-packet delays) is 
modeled as a stochastic process X (such as a Hidden Markov Model, Markov Chain or other 
formalism). Both the attacker. A, and the defender, D, monitor the environment through functions 
/a^J^ and /d G J-" respectively (for example, foi^) could be the probability distribution of 
/c-grams produced by X). 

The attacker guesses /d and manipulates X in order to produce a new process, A{X), so 
that covert communications can be performed while respecting the behavior that the defender 
expects; namely, /d(^) = fD{A{X)). 

On the other hand, the defender, by anticipating the attacker's guess of /d, picks a different 
fo and manipulates X to produce a new process D(X) so that: 

1) fDiA{D{X))) = fniD{X)) = foiX) = foiAiX)): the defensive shaping action is 
imperceptible to the defender who uses fo', 

2) fu{A{D{X))) f£,{D{X)): the attacker's action (that is, creation of a covert channel) is 
detectable by the defender. 

The game consists of attacker and defender guessing and then exploiting each other's monitoring 
strategy and manipulating the environment accordingly. The common objective of the players is 
to alter the environment in a manner that would be imperceptible to the opponent in order to 
perform a secret task (covert communication or covert channel detection). 

This work raises some questions which are deferred to future work. In particular, the following 
directions are worthy of future investigation: 

• Inter-packet delays involve real-world time so the question of stability when shaping the 
channel must be considered. That is, packets can be delayed by certain times only if there 
are packets in the queue to be delayed. Discussions of such queuing aspects of timing 
channels and the possibility of jamming them have been studied [|28ll . Relating this work 
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to timing channel jamming will be investigated. 

We used a circular buffer in Section |W] to numerically estimate A;-gram statistics so that the 
statistics have the required marginalization properties. A single pass, online algorithm for 
implementing this circular buffer only requires storing the first and last k symbols of the 
data. In the absence of such a buffer, the empirical statistics will not in general obey the 
marginalization identities and so some additional processing would be required. The use of 
singular value decompositions, non-negative matrix factorizations or other decomposition 
methods for imposing the regularity might be worth exploring further as alternatives to the 
circular buffer approach. 

In principle, one can attempt to build automata smaller than the Markov Chains we construct. 
In particular. Probabilistic Finite Automata (PFA) [|29l , [|30l . [|3TI could implement Markov 
Chains based on A;-grams but using fewer states. Unlike fc-gram based Markov Chains, k- 
PSAs have states that are labeled with input sequences of length at most k. So they can 
be seen as "variable length" A;-gram Markov Chains. They can be learned efficiently in the 
KL-PAC sense fJSl, 1331, [l34l and are generally smaller than A;-gram based Markov chains 
(by having fewer states). 

Within the space of possible Markov Chains that realize given Aj-gram statistics, it would 
be good to select the "best" chain from the point of maximizing entropy so that the 
covert channel coding is as efficient as possible. Our experiments suggest that the so-called 



Standard Solutions presented in Section |III-A| have the largest entropy although we have 
not been able to prove that analytically. 
• It is reasonable to ask how our results relate to the use of Hidden Markov Models for 
modeling traffic, as for example in [3]. It is known that a Hidden Markov Model with n 
states is completely determined by the 2?2-grams produced by the model so that reproducing 
2n-gram statistics will result in the same n state Hidden Markov Model [|8l. 
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